This document covers broad information about good research practices
while working with data. This has been created as a reference for the
Ohm lab @RPCCC. The primary
reference for this document is the “Reproducible Research
Practices” workshop organized by Alex’s Lemonade Stand
Foundation’s Childhood Cancer Data Lab on May 14-15, 2024. All
other references are included at the end of this document.
What is reproducibility? Why does it matter?
- Reproducibility means different things depending on the context. For
us, in the context of bioinformatics and computational oncology,
reproducibility means to provide enough information (data, code,
methods, etc.) to allow someone to replicate the exact analysis and
arrive at the same results.
Reproducibility crisis:
According to a study published in Nature in 2016, >70%
researchers failed to reproduce someone else’s work and >50% failed
to reproduce their own work (Survey of ~1500 researchers).
- Scientific papers often don’t have enough information to allow for
reproducibility of the results.
- This leads to decreased reliability of pulished literature and
causes distrust.
Cartoon “Scratch” from www.phdcomics.com
Reproducibility = Obtaining the same results when using the
same code, data and conditions of
analysis.
- Reproducibility doesn’t always mean that the analysis/result is
correct! It just makes sure that the research is transparent. It is the
minimum requirement for good data science. BUT this alone is not
sufficient.

Barriers for reproducible research
Bias -
Publication bias:
Information bias:
Replicability = Obtaining similar results across studies using
different data.
A replicable study can show that the original study was
reliable.
Together, reproducibility and replicability enhance the
reliability of the results. It acts as a quality check and reduces
bias.
Project organization
Sources:
+ Vince Buffalo: (https://www.oreilly.com/library/view/bioinformatics-data-skills/9781449367480/)
+ Jenny Bryan: https://speakerdeck.com/jennybc/how-to-name-files
+ Danielle Navarro: https://slides.djnavarro.net/project-structure
Why is organization important?
It makes things easier to find. This is done by defining a standard
and making the organization more predictable.
Best practices when organizing a project:
- Use a lot of folders and directories.
- Keep projects separated and keep sections within a project
separate.
Essential folders/directories for any project
- README, Data, Analysis/Code/Scripts, Figures, Results
README.md
- README files include essential information about a project that
another user should know about.
- This would be the first file anyone would read when trying to
understand your project.
- What a README file should include:
- Project title - make it informative.
- Project summary - including information of any specific
methods/techniques used.
- Project organization - if the file need to be read in a specific
format.
- Other things which could be included:
- More details! - always appreciated. Any other details which made
make it easy for someone to read and use your project.
- Future directions?
- Any specific challenges faced?
Data Folder
- Only big files go here! These files are used repeatedly
- Raw -> This usually from external sources. Try not to modify
this. To avoid modifications, you can change the settings to make it
unmodifiable.
- Separate subfolders for processed files.
- Use subdirectories by processing stage, date or sample.
It is often easiest to process all the things in a folder
together; organize by units of work
NAMING FILES AND FOLDERS
Pick a file naming convention. It doesn’t have to be the same as mine
but make sure it meets some basic criteria listed below!
Make it informative.Should know what it is even without opening
(preferably!)
Jenny Bryan’s standard:
Machine friendly:
- Avoid spaces
- Use underscores or dashes
RRBS_data_analysis.R
instead of
RRBS data analysis.R
- Use standard characters only!
- letters, numbers, underscores and dashes.
- Use periods only for file extensions.
- Avoid using special characters.
Differential_methylation.R
instead of
Differential.methylation.R
Be consistent with case!
- Don’t have two files with same name which only differ in case.
- Case may/may not have meaning, depends on the programming language and
OS.
Globbing
- Have multiple chunks in file names. Separate chunks with
underscores.
- This also allows the use of wildcards (*, and others) to
call all files that start or end with specific characters.
- Globbing allows to select file which match a “pattern”
- More information:
Globbing
- What’s even better than Globbing? - regex or
regular expression.
- Allows to extract information from file names.
- Could be easily parsed into a dataframe using R or any othe
programming language.
- Makes things very easy for you and others!
Example of regex file naming
Human friendly:
- Use long descriptive names (short names can be tempting) - to be
able to know what each file is without opening!
Don’t:
Analysis1.sh
Analysis2.sh
Do:
Alignment_Human_Genome.sh
Alignment_Mouse_Genome.sh
Sortable :
- Use numbers for sorting. Use 01, 02 (left pad).
- Use ISO 8601 standard dates (year-month-date or YYYY-MM-DD) which is
easier for sorting.
01.mouse_adapter-trimming.sh
02.human_adapter-trimming.R
#Don't:
10.trimmed.txt
1.stuff-a.csv
2.stuff-b.csv
Also:
2024-03-01_plasmid_sequence.txt
2024-04-27_cell-line_sequence.txt
#Don't
01-15-2024-backup.csv
22-05-2024_foo.R
Computable :
- For files from someone else.
- It is usually not advisable to rename them because the original names
are easier to track in conversation. - When renaming : Use
scripts to track the changes made. Only rename when absolutely
necessary!!!
Let’s not be this!
Strategies for Data Sharing
- data = all information to remake, interpret or use any figure,
table, etc. i.e code(versions, specific commands ran, code for
preprocessing, post processing, visualization), metadata (about the
sequencing and all the samples), documentation, data (raw,
processed).
- Coordinate with code and maintain records of all scripts, even if it
is analyzed by the core
- Maintain both raw and processed data and deposit them at
repositories.
- Be mindful of patient privacy when working with human subjects’
data.
- Documentation:: Contextual information - description of the data,
what each column name means. what does NA means? Origin of the data,
software used to prepare it?
Increases reliability
- Using a data repository: unique identifiers which can be cited, has
policies for long-term retention, data sharing policies.
- Choosing repositories - generaalist or specialist, raw or processed
data
- Complementary sharing avenues - lab/university server, Github,
Zenodo (used with github)
- Considerations::
- accessible considerations. - ?
- Interoperable - plain text files>>>> special formats
(Excel, Matlab), use ontologies to obtain shared vocab - use hierachial
graph; Eg: T-cells - not descriptive enough! Use a specific ontology
term (specific type of T-cell)
- Reusable considerations
Make a plan/lab-wide policy and stick to it! Naming, organizing,
sharing data files and how will you store/backup data
Organizing code in scripts and notebooks
data = read.csv("/path/to/file")
data=read.csv("/path/to/csv")
- Load all the packages up front at the top of the script instead of having random chunks of code with `library()`
- Comments are for the future you and collaborators. Explain why you
are doing what you are doing!
When updating code, remebmer to update comments too.
- Set-up and use R projects whenever applicable!
Managing packages and environment.
- Changes occur at all levels - scripts, packages, individual
programs, OS, hardware!
- For the analysis layer - we are using Git and GitHub. - use “tags”
and “releases” on GitHub - read more
- For package layer - changes in versions can affect the analysis,
dependencies may require specific versions of packages. use
sessionInfo() in R or sessioninfo::session_info()
- Using the same versions of the packages:
- renv R package! - tracks, freezes and shares R encireonments
- Each project can have its own environment with its own set of
packages
- doesnt use system R package library - creates library for each
packages, but in an optimized way.
- create renv.lock() file to describe the library.
Other references:
1. “Reproducibility and Replicability in Science.” - https://www.ncbi.nlm.nih.gov/books/NBK547546/#
2.
Github Cheat sheet
Git cheat sheet